Go over the syllabus
Introduction to R and RStudio
Installing R and RStudio
Basic R syntax and programming
Hands-on practice
Starting Lecture 2, the class format will generally follow the following format:
I will lecture for the first 90 mins (approximately).
Take a 10 min break.
Assigns practice problems to class so you can practice what was taught in lecture.
Go over the practice problems as a class. Completed problems posted as a guide to help you complete future homework.
What is R?
R is the open-source statistical language that seems to have taken over the world of statistics and data science. R is really more than a statistical package - it is a language or an environment designed to produce statistical analysis and production of high quality graphics.
Originally developed by two statisticians at the University of Auckland as a dialect of the S statistical language.
R is both open-source and open development. For more on information, see www.r-project.org/contributors.html
Figure 1. A screenshot of how R looks like in MacOS. Note: You will actually never work with R directly. You will work with R using RStudio.
Why learn R?
R is a powerful and flexible, free (open source) language designed specifically for statistical computing.
There is an extensive collection packages created by R users to extend R and implement modern statistical techniques.
Furthermore, R is an interpreted, high level language, which means that we can write code and run it in real time line by line without needing to worry about low level programming such as memory management.
What is RStudio?
RStudio is an integrated development environment1, or IDE, for R programming, which you can download from https://posit.co/download/rstudio-desktop/.
RStudio helps R users to effectively use R by making things easier. One example on next page
RStudio is updated a couple of times a year, and it will automatically let you know when a new version is out, so there’s no need to check back. It’s a good idea to upgrade regularly to take advantage of the latest and greatest features.
[1] IDEs are tools designed to increase programmer productivity by combining common activities of writing software into a single application: editing source code, building executables, and debugging.
Figure 2. A screenshot of how RStudio looks like in MacOS.
Both tools are widely used by scientists, academics, data analysts, and data scientists.
According to Glassdoor (as of June 6, 2024):
The median total yearly pay for data analysts in Washington, DC is $107,000.
The median total yearly pay for data scientists in Washington, DC is $183,000.
In my old team at the United States Dept of Agriculture, data
analysts and data scientists are currently making $117,962
- $153,354 in 2025.
In my current team at another federal agency, data scientists are
currently making $139,395 - $181,216 in
2025.
Warning: This class is an applied data science class. You will get a lot of practice. However, today’s class is full of definitions or terminologies that you will gain familiarity with throughout the semester. You don’t necessarily need to memorize most of the forthcoming terminologies. Although you will have some practice today, I see Lecture 2 next week as the real first class where you actually code properly.
Today’s penultimate slide will summarize key takeaways once we’ve done some practice after today’s exercise.
There are four ‘panes’ or windows in RStudio that we generally use. After your immediate installation, you may only see three (more on this later). But once you start writing and saving R scripts, you will regularly interact with all four panes.
These panes are:
Environment Pane.
Console Pane.
Files Pane.
Source Pane.
class: middle
Figure 3. This is called the Environment Pane from RStudio which allows users to track which variables or data have been saved into the R environment. More on this later.
Figure 4. This is called the Console Pane from RStudio (Linux Version) which allows users to type in and execute scripts.
Figure 5. Here is an example of a simple script for addition. Type ‘4+2’ then press Enter/return in the Console Pane.
Since we are discussing running basic simple R scripts:
Figure 6. Five basic arithmetic operators you can perform in R.
Note: + is addition; - is
substraction; * is multiplication; / is
division; and ** or ^ is exponentiation.
Figure 5. Here is an example of a simple script for addition. Type ‘4+2’ then press Enter/return in the Console Pane.
Throughout this course, we will rarely type in and execute/run scripts from the console pane. Generally, you want to save scripts you generate and execute within an R file (more on this later).
Today is one of those exemptions. Generally, we will run scripts in the console pane to install packages. For now, you may think of packages as a collection of tools to increase productivity and do specific tasks (i.e., certain packages can help you create maps). Within the next few slides, we will install packages that you will need for the first four weeks of the course.
An R package is a collection of functions, data, and documentation that extends the capabilities of base R.
Using packages is key to the successful use of R. The majority of the packages that you will learn in this course are part of the so-called tidyverse.
All packages in the tidyverse share a common philosophy of data and R programming and are designed to work together.
Type in then execute (by pressing enter/return) this code within your console pane: install.packages(“tidyverse”)
You only need to install this once. If you’ve used R previously, it is possible you might have it already.
Once you have tidyverse installed, you need to load the package each time you start a new R session.
More generally, you need to run this script for installing packages: install.packages(“[fill in package name]”)
To load a package, you need to run/execute this template script:
library([fill in package name])
Note: No quotation symbols when loading a package. Again, you need to load the package each time you start a new R session. After loading packages in R, you are then allowed to use programming ‘tools’ included within each package to increase your productivity and perform highly specialized tasks.
This may seem trivial for now but you will get a lot of practice throughout the course. This is how you load the tidyverse package into R:
library(tidyverse)
Figure 9. A screenshot of the Files pane. We will keep revisiting the Files pane throughout the semester.
Figure 10. Three panes you see upon opening RStudio. Initially, it excludes a fourth pane called source pane.
Figure 11. The fourth pane (source pane) will appear when you create a new file called R Script (or load an existing R file).
For practice (live demo): Click File > New File > RScript. Within the file, type in one of the scripts you learned (e.g., one of the five basic arithmetic operators). Once you are done: Click File > Save As. Name the file however you want and save it within the location that you can remember. Close RStudio. And try double clicking the file from the location where you saved it.
Figure 12. What you will see once you have a saved loaded file and after running script within them.
Note: To run a script from an RScript file, click anywhere on line 1 (or highlight the code you want to run), and press the ‘Run’ button on the upper right corner of the source pane.
A side note: Although I am teaching you how to create RScript (i.e., File > New File > RScript), we will be creating and using Quarto notebooks (more on this next week) throughout this semester.
Comments can be used to explain R code, and to make it more readable.
Comments starts with a #. When executing code, R will ignore anything that starts with #.
Figure 13. An example of a commented code in R. Important Note: The red ‘Untitled1’ implies this script is unsaved so make sure to always save your scripts.
There are data types in R that we will never use. Moreover, this is not a comprehensive programming course in R.
The one data type that we will commonly use and manipulate throughout the semester is called data frame. A data frame is a data structure constructed with rows and columns, similar to a nicely structured Excel spreadsheet or Google sheets. I may sometimes refer to this as tabular data.
Figure 14. An example of tabular data in Excel. When loaded into R, this will be read as a data frame.
What Is a Function in R? A function in R is one of the most used objects. It is an executable code that will perform certain tasks.
library() is an example of a function you were briefly
introduced to in earlier slides. It is an R code that allows you to load
a package. library(tidyverse) leverages the library
function to load the tidyverse package into R.
The file we will open and go through for today’s hands-on practice will introduce you to other functions in R that works with data frames.
log() is an R function that takes logarithms of numbers
you feed into it. Note: log() is technically ln()
# calculates ln(10)
log(10)
## [1] 2.302585
exp() is another R function that computes the
exponential value of the number you feed into it. For example, exp(2) is
equivalent to calculating \(e^{2}\).
# calculates exp(2)
exp(2)
## [1] 7.389056
Terminology: Functions such as exp()
and log() are functions from what is called base R. Base R
refers to built-in tools from the default installation of R.
The focus of this class, however, is the use of functions from packages to perform highly specialized tasks. Lectures 2-4 for example uses tidyverse functions for data visualization and manipulation.
This is a pattern we will generally use throughout the semester:
We generally start our R code by loading package(s) we need
(e.g., library(tidyverse)) at the very beginning.
After loading all the package(s) needed, we will load all the
data frame(s) needed into R. In the next few lectures, we will be using
what are called built-in data sets from R packages. I will give you the
name of the built-in data set (e.g., mpg), then you have to load it into
R by running: data([name I will give you]). In Exercise1.R,
we ran data(mpg).
Visualize or understand structure of the data frame (e.g.,
head(mpg), glimpse(mpg)).
Perform analysis (or create data visualization) using functions.
Next week: We will be creating basic data visualizations.